This notebook describes the procedure to determine the data for two
parts of the openEO Platform Use Case 8: “Fractional Canopy
Cover”:
- The local test site sampling procedure to decide the
test sites within the study area for which the very high resolution data
(VHR) will be acquired.
- The local validation data sampling necessary for the
quality assessment of the final map.
The test site sampling is the first step of the use
case as it delineates suitable test sites as training for a random
Forest regression approach. Given the restriction of cost and the wide
area to cover in the UC the study sites need to be chosen carefully. We
chose to search for 150 test sites with an extent of 16ha each spread
equally across the whole study area. Due to the fact that this step had
to be done in a very early stage of the project the scripts rely on
local versions of the EO layers needed for the computation. This
comprises both a heavy computation of 16ha Polygons over the whole area
and the implementation of a personalized scoring.
The validation data sampling has been conducted in
parallel to ensure that all the locally sampled datasets are present in
an early stage to later focus on the regression and prediction.
The EO-layers used in this part of the UC are:
- The Copernicus Forest High Resolution Layer Tree
Cover Density
- The Copernicus Forest High Resolution Layer Dominant
Leaf Type
- The 2018
CORINE Land Cover -CLC- data set
This long notebook has slightly different recurrent
terminologies:
- Study Area: The Whole bounding box of roughly 1 Mio
sqkm.
- Tile: Tiling of the Copernicus forest HRL.
- Potential test sites: All 16ha Polygons in the Study
area.
- Grid: Gridded (spatial) representation of the potential test
sites.
- Test sites: The final selection of the 150 test areas for the
VHR data request.
NOTE: This has workflow has been implemented locally and is therefore only partially reprodicible on other distribution. The intermediary data sets, however, are included in /resources/UC8/Local/
The whole workflow was created and is working using R 4.1.2 on a
Ubuntu 18.04 bionic distribution.
Prior to the actual workflow the necessary R libraries need to be
loaded. If they are not yet present on the distribution they have to be
installed using the install_packages() function.
library("raster")
library("terra")
library("sf")
library("mapview")
library("dplyr")
library("tidyr")
library("purrr")
library("stars")
library("starsExtra")
library("stringr")
library("tictoc")
library("readr")
library("tibble")
library("parallel")
library("ggplot2")
library("exactextractr")
library("leaflet")
First of all the Netcdf files with the study site extents are loaded:
eusalpbb<-st_read("resources/UC8/Local/alpinespace_eusalp_boundingbox.shp")
eusalpbb.sites<-st_make_grid(eusalpbb,n = 13) %>%
st_as_sf() %>%
mutate(ID=c(1:nrow(.)))
eusalpbb.laea<-eusalpbb %>% st_transform(crs=3035) # LAEA projection applied to fasten some processes
Next, the data sets must be imported. This comprises both the study
area as well as the HRL and CLC files. Since this approach was created
before the collections were available on the back-ends in order to have
a quick VHR datacube the files are loaded locally.
First we import the CLC as stars proxy object. We can do that just once
as it is not tiled or divided otherwise
r.clc <-read_stars("/mnt/CEPH_PROJECTS/SAO/ForestCanopy/01_Data/CORINE_2018_250m_Raster/DATA/U2018_CLC2018_V2020_20u1.tif",proxy=T)
clc_classes <- read_csv("resources/UC8/Local/clc_classes_final.csv",show_col_types = F) # The connection between the CLC numeric values and classes
Next we import a csv file containing the tiled Copernicus HRL data as stored locally in the file system
hrl.join <- read_csv("resources/UC8/Local/hrl_table.csv",show_col_types = F)
These sites have the extent of 16ha as defined for the 150 test sites within the study site. Therefore several grids are constructed with polygons of 16ha throughout the study area leading to a total of 3.56 Mio Polygons to be analyzed. The grids are calculated based on the Copernicus HRL tiling for convenience
for (i in 1: nrow(hrl.join)) {
data <- hrl.join$TCD[[i]] # Load the Tree Cover Density representing the extent
outfile <- paste0(gridloc,hrl.join$Tile[[i]],"_grid.nc") # Name of the file
r1 <- read_stars(data,proxy=T) # Read data as stars proxy to save time
r.grid <- st_make_grid(st_bbox(r1),400) %>% st_transform(crs=3035) # Make Grid across the raster with 400x400m (16ha)
r.grid.f<- r.grid[st_within(r.grid,eusalpbb) %>% as.numeric() %>% {which(.==1)}] # CHeck that it is in the actual extent
if(length(r.grid.f)>0) st_write(r.grid.f,outfile) # Write the Grid
toc()
}
Once the process is finished and the grids exported they are the base
for the computation of the metrics.
One exemplary grid is illustrated here by the centroids of each Polygon
(for faster plotting):
example<-st_read("resources/UC8/Local/E38N22_grid.nc")
## Reading layer `E38N22_grid' from data source
## `/mnt/CEPH_PROJECTS/sao/ForestCanopy/SRR3_notebooks/notebooks/resources/UC8/Local/E38N22_grid.nc'
## using driver `netCDF'
## Simple feature collection with 39315 features and 0 fields
## Geometry type: POLYGON
## Dimension: XY
## Bounding box: xmin: 3805600 ymin: 2231200 xmax: 3900000 ymax: 2300000
## Projected CRS: ETRS89-extended / LAEA Europe
mapview(st_centroid(example),layer.name="Polygon Centroids")
After all the potential test sites have been delineated the metrics
for the single potential test sites have to be determined. This is done
by calculating the metrics of the HRL and CLC Layers for each individual
Polygon. This step requires a lot of data to be loaded individually
while the metrics are being calculated. In order to use ONE function in
a parallelized way, the whole metrics generation is done using the
calcMetric function below
# R1 object is the tree cover density stars proxy object
# R2 parameter is the tree type stars proxy object
# R3 object is the Corine Lanc Cover stars proxy object
# j is an iterator looping through the grids. This is more efficient than a dedicated for-loop
calcMetric=function(r1,r2,r3,j){
# Crop Stars proxy objects
pol <- r.grid[j,]
cr.tcd <- st_crop(r1,pol)
cr.trtype <- st_crop(r2,pol)
cr.clc <- st_crop(r3,pol)
# Calculate the TreeDensity
cr.tcd2 <- st_as_stars(cr.tcd)
tcd.mn <- as.numeric(as.character(pull(cr.tcd2)))
tcd.mn <- round(mean(tcd.mn,na.rm=T))
if(tcd.mn<10 | tcd.mn>90) return(NA)
# Calculate the Tree Type / Density and Dominance
cr.trtype2 <- st_as_stars(cr.trtype)
tree.type <- table(as.numeric(as.character(pull(cr.trtype2))))
if(length(tree.type)<2) return(NA)
brd<-as.numeric(tree.type[which(as.numeric(names(tree.type))==1)]) / sum(tree.type)
if(length(brd)==0) brd=0
con<-as.numeric(tree.type[which(as.numeric(names(tree.type))==2)]) / sum(tree.type)
if(length(con)==0) con=0
dominance <- (brd-con)/(brd+con)
dominant_type <- ifelse(dominance>0,"BroadLeaved","Coniferous")
dominance_abs <- round(abs(dominance)*100)
forest.perc <- round((1-(brd-con))*100)
if(forest.perc<10 | forest.perc>90) return(NA)
# Calculate the CLC Objects
cr.clc2 <- st_as_stars(cr.clc)
clc.res <- pull(cr.clc2) %>% as.character %>% as.numeric %>% table %>% as_tibble()
colnames(clc.res)=c("Raster_ID","n")
clc.res2 <- clc.res %>% mutate(Raster_ID=as.numeric(Raster_ID))
clc.res3 <- left_join(clc_classes3,clc.res2,by = "Raster_ID") %>% na.omit %>% select(L1_class,Target,n)
if(any(clc.res3=="Forests")){
if(nrow(clc.res3)==1) return(NA)
clc.nclasses<-nrow(clc.res3)
clc.forest <- as.numeric(round((filter(clc.res3,Target=="Forests")$n / sum(clc.res3$n))*100) )
if(clc.forest<10 | clc.forest>90) return(NA)
if(clc.nclasses<2 | clc.nclasses==5) return(NA)
clc.other <- filter(clc.res3,Target!="Forests")
clc.other <- arrange(clc.other,n)
clc.other <- clc.other$Target[1]
} else { return(NA) }
# Combine the results in one dataframe
b1<-bind_cols(DomAbs=dominance_abs,DomType=dominant_type,
HRLPerc=forest.perc,Density=tcd.mn,
CLCclasses=clc.nclasses,CLCperc=clc.forest,CLCother=clc.other)
return(b1)
}
The function needs to be applied to each polygon through a for
loop.
ATTENTION: This process is heavily parallelized with
the parallel package. Please pay attention to the
deployable resources in your OS.
for(i in 1:nrow(hrl.join)){
# Read and Format the Raster
r1<-read_stars(hrl.join$TCD[[i]],proxy=T)
r2<-read_stars(hrl.join$DLT[[i]],proxy=T)
r.grid <- st_read(hrl.join$Grid[[i]],quiet = T) %>% st_transform(crs=st_crs(r1)) %>% st_as_sf
r.grid$ID <- c(1:length(r.grid))
r.grid$Area <- st_area(r.grid)
ith <-as.list(c(1:nrow(r.grid)))
# Parallelization
no_cores <- detectCores()-1
clust <- makeCluster(no_cores,outfile="Log/Log.txt")
clusterExport(cl=clust,varlist=c("r.grid","r1","r2","r3","ith","calcMetric","clc_classes3"))
a<-clusterEvalQ(
cl=clust,
c(library("stars"),library("dplyr"),library("tidyr"),library("sf"),library("tibble")))
pl<-parLapply(cl=clust, ith, function(x) calcMetric(r1,r2,r3,x))
stopCluster(clust)
# Tidy the parallelization result
nisna<-which(!is.na(pl))
r.grid2<-r.grid %>%
mutate(Data=pl) %>%
slice(nisna) %>%
unnest(Data)
# Save the Result
out2<-paste0(shapeloc,hrl.join$Tile[[i]],"_shapefile_select_noscore.nc")
if(nrow(r.grid)>0) st_write(r.grid2,out2)
}
This is the structure of the final metrics dataset:
data<-st_read("resources/UC8/Local/E38N22_shapefile_select_noscore.nc")
## Reading layer `E38N22_shapefile_select_noscore' from data source
## `/mnt/CEPH_PROJECTS/sao/ForestCanopy/SRR3_notebooks/notebooks/resources/UC8/Local/E38N22_shapefile_select_noscore.nc'
## using driver `netCDF'
## Simple feature collection with 199 features and 9 fields
## Geometry type: POLYGON
## Dimension: XY
## Bounding box: xmin: 3810400 ymin: 2269200 xmax: 3900000 ymax: 2300000
## Projected CRS: ETRS_1989_LAEA
data
While this plot represents the potential test sites colored by the Tree Cover Density from the HRL:
example.bbox<-st_as_sfc(st_bbox(example))
mapview(example.bbox,layer.name="Example Grid Boundary")+
mapview(data,zcol="HRLPerc",layer.name="Tree Cover Density")
Based on the Metrics a scoring system needs to be implemented. This
means that the target variables need to be defined based on a
universally valid scheme to determine which potential sites are actually
most important for the detection of the FCC target variable. There are
three scoring schemes:
- Tree Density: The tree density of the 16ha site. The
score is highest when approx. 50% as there are both forested and
non-forested Pixels for training
- CLC Classes: 2 to three classes preferred. Once class
would be only forest and more than three might introduce noises or
difficulties to discriminate between classes.
- Tree Dominance. The more dominant one tree type the
better. This reduces the probability of mixed forests
# Tree density and forest percentage
Values<-c(0:100)
Score <-c(0,rep(1:5,each=10),rep(5:1,each=10))
score_density <-bind_cols(Values=Values,Score=Score)
# CLC Classes
Values<-c(1:10)
Score <-c(1,5,5,3,1,0,0,0,0,0)
score_nclc <-bind_cols(Values=Values,Score=Score)
# Tree Dominance
Values<-c(0:100)
Score <-c(0,rep(1:5,each=20,len=100))
score_dom <-bind_cols(Values=Values,Score=Score)
In total five scores are used for the final decision. While the CLC classes and the Tree Dominance are used once on the respective layers the Tree Density scoring scheme is used for three separate layers: HRL tree percentage, CLC tree percentage and HRL tree density. An overview can be seen in the following picture:
finaltable<- readRDS("resources/UC8/Local/scoretable.rds")
ggplot(finaltable,aes(x=Values,y=Score))+
geom_point(alpha=0.2)+
geom_line()+
facet_wrap(~Type,scales="free_x")+
ggtitle("Score crierions for the VHR data test site derivation")
Now the scores are calculated iteratively for each of the polygons
and appended to the overall shapefile. Additionally, the
eusalpbb_sites extent is reprojected to the matching CRS in
order to calculate the scores at the right location. Given the huge
file size the files are loaded locally
data.noscore <- list.files("/mnt/CEPH_PROJECTS/SAO/ForestCanopy/02_Products/01_StudyArea/AllShapes/",full.names=T)
polymaster <- st_read(data.noscore[1],quiet=T)
tls <- st_transform(eusalpbb.sites,st_crs(polymaster))
all<-list()
for(i in 1:length(data.noscore)){
# Read the Polygon(s)
poly<-st_read(data.noscore[i],quiet=T)
if(nrow(poly)==0) next
# Check if the Polygons are in the EUSALP boundng box
within <- as.numeric(st_within(poly,tls))
# Filter if they are not in the Extent
poly2<-poly %>%
mutate(inTile=within) %>%
filter(!is.na(inTile)) %>%
na.omit()
# Calculate the scores
poly2.score<-poly2 %>%
mutate(ScoreDom = map_dbl(DomAbs, function(x) unlist(score_dom[which(score_dom$Values==x),2]))) %>%
mutate(ScorePerc1 = map_dbl(HRLPerc, function(x) unlist(score_density[which(score_density$Values==x),2]))) %>%
mutate(ScorePerc2 = map_dbl(CLCperc, function(x) unlist(score_density[which(score_density$Values==x),2]))) %>%
mutate(ScoreCLC = map_dbl(CLCclasses,function(x) unlist(score_nclc[which(score_nclc$Values==x),2]))) %>%
mutate(ScoreDens = map_dbl(Density, function(x) unlist(score_density[which(score_density$Values==x),2]))) %>%
mutate(ScoreAll = ScoreDom+ScorePerc1+ScorePerc2+ScoreCLC+ScoreDens)
# Add to the list
all[[i]]<-poly2.score
}
# Combine ALL Polygons to one sf file
# This process may take a while. Approximately 870000 Shapefiles were produced
comb<-sf::st_as_sf(data.table::rbindlist(all)) %>% arrange(inTile)
This is the result of the computation with the single scores appeneded as columns together with the summary - the total scoring by Polygon. This file is loaded from local as it has a total size of approximately 1Gb
comb<-st_read("/mnt/CEPH_PROJECTS/SAO/ForestCanopy/02_Products/01_StudyArea/SuitableSitesVHR_all.nc")
## Reading layer `SuitableSitesVHR_all' from data source
## `/mnt/CEPH_PROJECTS/sao/ForestCanopy/02_Products/01_StudyArea/SuitableSitesVHR_all.nc'
## using driver `netCDF'
## Simple feature collection with 870258 features and 16 fields
## Geometry type: POLYGON
## Dimension: XY
## Bounding box: xmin: 3810400 ymin: 2232400 xmax: 4871600 ymax: 3074400
## Projected CRS: ETRS_1989_LAEA
comb
Once all the polygons are associated to a scoring the complete data set needs to be reduced to the final 150 test sites.
A first step is to reduce the tiles to the 150 with the best / most suitable test sites. This is done by sorting and reducing them:
comb.bytile<-comb$inTile %>%
table %>%
as_tibble() %>%
setNames(c("ID","Polygons")) %>%
mutate(ID=as.numeric(ID))
polybytile<-left_join(eusalpbb.sites,comb.bytile) %>%
na.omit %>%
arrange(desc(Polygons))
## Joining, by = "ID"
selectedTiles<-polybytile[1:150,]
mapview(selectedTiles)
Now the overall potential test sites are reduced and sorted by the scores.
all2<-list()
for(i in 1:nrow(selectedTiles)){
sel<-comb %>% filter(inTile==selectedTiles$ID[[i]]) %>% arrange(desc(ScoreAll))
sel2<-sel %>% filter(ScoreAll==max(sel$ScoreAll))
all2[[i]]<-sel2
}
comb2<-sf::st_as_sf(data.table::rbindlist(all2)) %>% arrange(inTile)
Finally, the test sites must be reduced to 150 for the VHR data as
agreed in the KO meeting. This is the actual extent for which we were
capable to obtain data. As they should be spread throughout the study
site one Polygon is extracted for each of the previously defined tiles
selectedTiles. In order to make the selection reproducible
the set_seed function - a random Number generator - is
called at the beginning of each iteration
all3<-list()
for(i in 1:nrow(selectedTiles)){
set.seed(i) # The seed for the random number generator
tosample <-comb2 %>% filter(inTile==selectedTiles$ID[[i]]) # Get one Tile
smp <-sample(1:nrow(tosample),1) # Get one random sample per site
issample <-tosample[smp,] # Sample one Polygon
all3[[i]]<-issample # Attach the Polygon to a list
}
comb3<-sf::st_as_sf(data.table::rbindlist(all3)) # reduce the list to a single shapefile
The comb3 dataset represents the final selected Polygons
per Tile as base for the Planetscope data request. This request has been
performed by Sinergise and the corresponding collection is hosted on the
VITO back-end and visible in the openEO
Platform Collectons.
The simple features vector file for the final selection is in the SRR-3 Github repository of the study sites can be seen here:
final_sites<-st_read("resources/UC8/Local/SuitableSitesVHR_selected_country.nc")
## Reading layer `SuitableSitesVHR_selected_country' from data source
## `/mnt/CEPH_PROJECTS/sao/ForestCanopy/SRR3_notebooks/notebooks/resources/UC8/Local/SuitableSitesVHR_selected_country.nc'
## using driver `netCDF'
## Simple feature collection with 150 features and 17 fields
## Geometry type: POLYGON
## Dimension: XY
## Bounding box: xmin: 3831600 ymin: 2237600 xmax: 4852800 ymax: 3053200
## Projected CRS: ETRS_1989_LAEA
final_sites
and a graphic representation of the test sites here. For better plotting purposes, all datasets are converted to Lat/Lon WGS84:
final_sites_center <- st_centroid(final_sites) %>% st_transform(4326)
## Warning in st_centroid.sf(final_sites): st_centroid assumes attributes are
## constant over geometries of x
mapview(final_sites_center,layer.name="Selected Test Sites",col.regions="red",alpha=0)+
mapview(selectedTiles,layer.name="Selected Subtiles")+
mapview(eusalpbb,layer.name="EUSALP BoundingBox",alpha.regions=0.1,color="green",col.regions="green")
Based on the calculations that we have done in the test site retrieval we also extracted the validation sites at a very early stage of the use case. It was important, however to not include the test sites as validation sites. Therefore the search of the validation data is performed surrounding these sites. As input we used the FCC target variable NetCDF files as uploaded in the SRR3 Notebooks.
For the derivation of the validation data we need the following dataset / information:
fls=list.files("/mnt/CEPH_PROJECTS/SAO/ForestCanopy/SRR3_notebooks/notebooks/resources/UC8/vector_data/target_canopy_cover_60m_equi7/",full.names=T) # List of the resampled target variable
res=60 # Target resolution
Now the shapefiles surrounding the test sites are defined. We sample
points on a regular 60m grid (as the target variable resolution)
surround the study target variable Pixel. From the resulting potential
points we select 20 with a random number generator and combine the
result in the shp.dat spatial file. The grids for each
study site are stored in the grid.dat object.
shapelist= list()
gridlist = list()
for(i in 1:length(fls)){
shp1 <-st_read(fls[i],quiet=T) %>% st_transform(crslaea)
shp1c <-st_centroid(shp1)
# Create a sequence of coordinates outside of the actual study area
tseq <- c(seq(1000,res*1000,res*10),seq(-1000,-(res*1000),-(res*10)))
grid <- expand_grid(X=st_coordinates(shp1c[1,])[1]+tseq,Y=st_coordinates(shp1c[1,])[2]+tseq)
shp2 <- st_as_sf(grid,coords=c("X","Y"),crs=st_crs(shp1)) %>% mutate(ID=i)
gridlist[[i]]<-grid
# Search in case any of the Shapes is outside the bounding Box
wh<-which(st_within(shp2,eusalpbb.laea,sparse = F)==T)
shp.within<-shp2[wh,]
# Randomly sample 20 sites. The set.seed determines the Random Number generator for reproducability
set.seed(10)
smp <- sample(nrow(shp.within),20)
# Extract the 20 sites and create a square buffer around them (representing the test site Polygons)
shp.smp <-shp.within[smp,]
result <-st_buffer(shp.smp,60/2,endCapStyle="SQUARE",nQuadSegs=1)
shapelist[[i]]<-result
}
shp.dat=do.call(bind_rows,shapelist)
grid.dat=do.call(bind_rows,shapelist)
A zoomed in plot of the study site shows that the grid is not interferring with the study site and therefore offers validation sites not touched by the training of the model.
shp.dat<-st_read("/mnt/CEPH_PROJECTS/SAO/ForestCanopy/SRR3_notebooks/notebooks/extdata/ValPol_UC8_raw.nc")
## Reading layer `ValPol_UC8_raw' from data source
## `/mnt/CEPH_PROJECTS/sao/ForestCanopy/SRR3_notebooks/notebooks/extdata/ValPol_UC8_raw.nc'
## using driver `netCDF'
## Simple feature collection with 2940 features and 0 fields
## Geometry type: POLYGON
## Dimension: XY
## Bounding box: xmin: 3865542 ymin: 2282032 xmax: 4819832 ymax: 3023525
## Projected CRS: ETRS_1989_LAEA
grid.dat<-st_read("/mnt/CEPH_PROJECTS/SAO/ForestCanopy/SRR3_notebooks/notebooks/extdata/ValGrid_UC8_raw1.nc")
## Reading layer `ValGrid_UC8_raw1' from data source
## `/mnt/CEPH_PROJECTS/sao/ForestCanopy/SRR3_notebooks/notebooks/extdata/ValGrid_UC8_raw1.nc'
## using driver `netCDF'
## Simple feature collection with 39204 features and 1 field
## Geometry type: POINT
## Dimension: XY
## Bounding box: xmin: 4103802 ymin: 2964278 xmax: 4223402 ymax: 3083878
## Projected CRS: ETRS89-extended / LAEA Europe
pnt = shp.dat[1,] %>% st_transform(4326)
grid = grid.dat %>% st_transform(4326)
mv <- mapview(grid,layer.names="Potential Validation Sites")+
mapview(pnt,names="Study site Polygons", col.regions="red")
pnt.coords = pnt %>% st_bbox %>% st_as_sfc %>% st_centroid %>% st_coordinates
mv@map %>% setView(pnt.coords[1], pnt.coords[2], zoom = 12)